Packages

library(tidyr)
library(ggplot2)
library(dplyr)
library(ggcorrplot)
library(plotly)
library(sampling)
library(ggpubr)
library(manipulateWidget)

Dataset Details

This data set was retrieved from Kaggle. This dataset contains information on the relationship between personal attributes (age, gender, BMI, children dependents, smoking habits), geographic factors, and their impact on medical insurance charges.

Objective of analysis

It can be used to study how these attributes influence insurance costs.

Reading dataset

health_ins<-read.csv('C:\\Deepa\\Boston University\\Assignments\\MET CS 544\\Final Project\\Final Project dataset\\insurance.csv')
head(health_ins) 
##   age    sex    bmi children smoker    region   charges
## 1  19 female 27.900        0    yes southwest 16884.924
## 2  18   male 33.770        1     no southeast  1725.552
## 3  28   male 33.000        3     no southeast  4449.462
## 4  33   male 22.705        0     no northwest 21984.471
## 5  32   male 28.880        0     no northwest  3866.855
## 6  31 female 25.740        0     no southeast  3756.622

Convert Categorical Variables to Factors for analysis

health_ins$sex<-as.factor(health_ins$sex)
health_ins$smoker<-as.factor(health_ins$smoker)
health_ins$region<-as.factor(health_ins$region)
health_ins$children<-as.factor(health_ins$children)

Summary statistics for the data set, providing information like mean, median, quartiles, and counts for each variable.

summary(health_ins)
##       age            sex           bmi        children smoker    
##  Min.   :18.00   female:662   Min.   :15.96   0:574    no :1064  
##  1st Qu.:27.00   male  :676   1st Qu.:26.30   1:324    yes: 274  
##  Median :39.00                Median :30.40   2:240              
##  Mean   :39.21                Mean   :30.66   3:157              
##  3rd Qu.:51.00                3rd Qu.:34.69   4: 25              
##  Max.   :64.00                Max.   :53.13   5: 18              
##        region       charges     
##  northeast:324   Min.   : 1122  
##  northwest:325   1st Qu.: 4740  
##  southeast:364   Median : 9382  
##  southwest:325   Mean   :13270  
##                  3rd Qu.:16640  
##                  Max.   :63770

Data frame created by grouping the health_ins data frame by the ‘sex’ column calculating the count of each group

df<- health_ins %>% group_by(sex) %>% summarise(count=n())

Insurance coverage among Males and Females

Analyzing distribution of insurance coverage among males and females in the data set, offering insights into the gender-based patterns of insurance enrollment.

Below plot provides a clear visual comparison of the number of insured individuals in each gender category. The number of females and males are almost equal, males being slightly higher than females in the data set

a<-ggplot(data=df, aes(x=sex, y=count, fill=sex)) +
  geom_bar(stat="identity",width=0.3) + 
  scale_y_continuous(breaks=seq(0,700,200))  +
  theme_minimal()+
  scale_x_discrete(labels = c("Female", "Male")) +
  theme(legend.position="right")+
  ggtitle("Barplot of Frequency of Females and Males")+
  theme(plot.title = element_text(hjust=0.5))+
  labs(x = "Gender", y = "Count") 

ggplotly(a)

Age wise count of people with insurance coverage

The histogram illustrates the distribution of individuals with insurance coverage across different age groups. Notably, the number of insured individuals in their 20s surpasses that of other age categories, with the count in the 60s being the lowest. This trend suggests that people in their 20s may demonstrate a greater propensity to obtain insurance, potentially influenced by factors such as heightened awareness of health and financial planning.

b<-ggplot(data=health_ins, aes(x=age)) +
  geom_histogram(fill = "blue",bins=40, colour='white') +
labs(x ='Age', y='Count', title = 'Age wise Count') +
  theme_minimal() +
  theme(plot.title = element_text(hjust=0.5))

ggplotly(b)

Groups the ‘health_ins’ dataframe by the ‘sex’ and ‘smoker’ columns. Calculates the number of individuals for each combination of ‘sex’ and ‘smoker’ groups.

df1 <- health_ins %>%
  group_by(sex, smoker) %>%
  summarise(count = n())

Analysis of proportion of male and female smokers in the ‘health_ins’ data set.

Bar plot depicts the percentage of male and female smokers in the data set. The analysis shows that proportion of smokers in male category are higher than the females

c<-ggplot(df1, aes(x = sex, y = count, fill = smoker)) +
  geom_bar(stat = "identity", position = "fill") +
  labs(title = "Counts of Male and Female Smokers",
       x = "Smoker",
       y = "Percentage") +
  theme_minimal() +
  theme(plot.title = element_text(hjust=0.5))+
  scale_fill_manual(values = c("yes" = "pink", "no" = "blue")) +
  scale_y_continuous(labels = scales::percent_format())

ggplotly(c)

Analysis of Gender & Smoker-wise distribution of charges

The below is the Box plot distribution of insurance charges in the ‘health_ins’ data set, stratified by gender (‘sex’) and further differentiated by smoking status (‘smoker’). It allows for a quick comparison of central tendencies, variability, and the presence of outlier in charges across these groups As observed from the box plot the insurance charges are expectedly higher for smokers as compared to non smokers. Also interesting median charges of male smokers is comparatively higher than female smokers There are outliers in non smokers that could be because of other factors like dependents or bmi or region

plot_ly(data=health_ins, y=~charges,type='box',x=~sex,color=~smoker,colors = c('red','green'))%>% 
  layout(boxmode = "group", 
         xaxis = list(title='Gender'), 
         yaxis = list(title='Charges'), legend = list(title = list(text = "<b>Smoker</b>")))

Correlation matrix of variables age, bmi and charges.

Correlation matrix shows the pairwise correlation coefficients between these variables

corr<-cor(health_ins[,c(1,3,7)])
corr
##               age       bmi   charges
## age     1.0000000 0.1092719 0.2990082
## bmi     0.1092719 1.0000000 0.1983410
## charges 0.2990082 0.1983410 1.0000000

Analysis of correlation between age, bmi and charges through correlation matrix plot

The correlation coefficient between ‘age’ and ‘bmi’ is approximately 0.109 indicating a weak positive correlation between them.As age increases, BMI tends to slightly increase.

The correlation coefficient between ‘bmi’ and ‘charges’ is approximately 0.198 indicating a weak positive correlation between them. As bmi increases, charges tends to slightly increase.

The correlation coefficient between ‘age’ and ‘charges’ is approximately 0.299.This indicates age and charges are moderately correlated and as age increases charges tend to increase.

e<-ggcorrplot(corr,lab=T)
ggplotly(e)

Analysis of distribution of insurance charges in the dataset.

The histogram reveals a predominance of individuals with insurance charges around $20,000, suggesting a right-skewed distribution. This implies that a greater proportion of people are paying lower insurance charges compared to those with higher costs. This aligns with common sense, as those with elevated insurance charges may be influenced by factors such as smoking, older age, and geographical location.

options(scipen=999)

x<-ggplot(data=health_ins, aes(x=charges)) +
  geom_histogram(fill = "blue",bins=40, colour='white',aes(y=..density.. )) +
  labs(x ='Charges', y='Density', title = 'Insurance charges') +
  theme_minimal() + geom_density(linetype='dashed', lwd=1.2)+
  theme(plot.title = element_text(hjust=0.5))

ggplotly(x)

Analysis of Distribution of Charges by Region and Smoking Status

Distribution of charges across different regions, with a distinction between smokers and non-smokers. The median charges are highest in southeast for smokers followed by southwest. This shows that insurance charges in southern region is high as compared to northern regions. Interestingly, for non-smokers, the charges are in almost same range across regions

plot_ly(data=health_ins, y=~charges,type='box',x=~region,color=~smoker, colors = c('red','green'))%>% 
  layout(boxmode = "group", 
         xaxis = list(title='Region'), 
         yaxis = list(title='Charges'), legend = list(title = list(text = "<b>Smoker</b>")))

Central Limit Theorem

The central limit theorem (CLT) states that the distribution of sample means approximates a normal distribution as the sample size gets larger, regardless of the population’s distribution. Using the charges attribute in this data set the applicability of the central limit theorem can be shown. There is a right-skewed distribution of insurance charges in the data set as displayed in the insurance charges histogram Below are histograms showing the sample means of 1000 random samples of sample sizes 10, 30, 60, and 80 that follow a normal distribution. As sample size increases the distribution tends to be more normally distributed.

Central Limit Theorem simulation

Number of samples and different sample sizes

options(scipen=999)
sample<- 1000
sample.size1<- 10
sample.size2<- 30
sample.size3<- 60
sample.size4<- 80

#variable to store sample means for different sample sizes
xb1<- numeric(sample)
xb2<- numeric(sample)
xb3<- numeric(sample)
xb4<- numeric(sample)

set.seed(2698) # Set seed for reproducibility

# Generate samples and calculate means for each sample size

for(i in 1:sample){
  xb1[i]<- mean(sample(health_ins$charges, sample.size1, replace = F))
  xb2[i]<- mean(sample(health_ins$charges, sample.size2, replace = F))
  xb3[i]<- mean(sample(health_ins$charges, sample.size3, replace = F))
  xb4[i]<- mean(sample(health_ins$charges, sample.size4, replace = F))
}

# Create histograms for each sample size

d1<-data_frame(val = xb1) %>%
  ggplot(., aes(val)) + 
  geom_histogram(fill="blue", aes(y=after_stat(density))) +xlab('Mean of charges')+ylab('density')+
  ggtitle('Distribution of mean of charges N=10') +xlim(0, 30000) +
  ylim(0, 0.00025)

d2<-data_frame(val = xb2) %>%
  ggplot(., aes(val)) + 
  geom_histogram(fill="red", aes(y=after_stat(density))) +xlab('Mean of charges')+ylab('density')+
  ggtitle('Distribution of mean of charges N=30') +xlim(0, 30000) +
  ylim(0, 0.00025)

d3<-data_frame(val = xb3) %>%
  ggplot(., aes(val)) + 
  geom_histogram(fill="green", aes(y=after_stat(density))) +xlab('Mean of charges')+ylab('density')+
  ggtitle('Distribution of mean of charges N=60')+xlim(0, 30000)

d4<-data_frame(val = xb4) %>%
  ggplot(., aes(val)) + 
  geom_histogram(fill="pink", aes(y=after_stat(density))) +xlab('Mean of charges')+ylab('density')+
  ggtitle('Distribution of mean of charges N=80') +xlim(0, 30000) +ylim(0, 0.0003)

ggarrange(d1,d2,d3,d4,nrow=2,ncol=2) 

# Print  statistics
cat('Population mean is:',round(mean(health_ins$charges),2))
## Population mean is: 13270.42
cat('Population standard devation is:',round(sd(health_ins$charges),2))
## Population standard devation is: 12110.01
cat('Mean of sample means of size N=10 is: ', round(mean(xb1),2))
## Mean of sample means of size N=10 is:  13255.76
cat('SD of sample means of size N=10 is: ',sd(xb1))
## SD of sample means of size N=10 is:  3786.774
cat('Mean of sample means of size N=30 is: ', round(mean(xb2),2))
## Mean of sample means of size N=30 is:  13226.56
cat('SD of sample means of size N=30 is: ',sd(xb2))
## SD of sample means of size N=30 is:  2193.839
cat('Mean of sample means of size N=60 is: ', round(mean(xb3),2))
## Mean of sample means of size N=60 is:  13166.96
cat('SD of sample means of size N=60 is: ',sd(xb3))
## SD of sample means of size N=60 is:  1557.94
cat('Mean of sample means of size N=80 is: ', round(mean(xb4),2))
## Mean of sample means of size N=80 is:  13239.42
cat('SD of sample means of size N=80 is: ',sd(xb4))
## SD of sample means of size N=80 is:  1296.041

Sampling methods

Sampling is the selection of a subset or a statistical sample of individuals from within a statistical population to estimate characteristics of the whole population. There are many different sampling techniques including simple random sampling, systematic sampling, and stratified sampling Simple Random Sampling Without Replacement (SRSWOR) is a method of sampling where every possible sample of a certain size has an equal chance of being selected, and once a unit is selected, it is not put back into the population. Systematic sampling is a method where samples are selected via a fixed interval. Stratified sampling is when the larger group of data is broken into smaller strata and then certain sizes are picked from each group.

We are analyzing the relationship between Body Mass Index (BMI), smoking status and insurance charges. The analysis is conducted on the entire population and using three different sampling methods as stated above. The sample size of 500 is being used The scatter plots and regression lines provide insights into how the relationship varies across different sampling techniques. As seen in Scatter plot, the charges proportionately increase with increase in bmi for smokers while this is not the case with non smokers. In non smokers charges are on lower side with few outliers that could be due to dependents or region they belong to.

a1<-ggplot(health_ins, aes(x = bmi, y = charges))+
geom_point(aes(color = smoker, shape = smoker))+
  geom_smooth(aes(color = smoker, fill = smoker), 
              method = "lm", fullrange = TRUE) +
  facet_wrap(~smoker) +
  stat_cor(aes(label = ..r.label..))+
  scale_color_manual(values = c("#00AFBB", "#E7B800"))+
  scale_fill_manual(values = c("#00AFBB", "#E7B800")) +
  theme_bw() + ggtitle("Population graph")

# Simple Random Sampling Without Replacement (SRSWOR)
set.seed(2698)
s <- srswor(500, nrow(health_ins))

sample.2 <- health_ins[s != 0, ]

b1<-ggplot(sample.2, aes(x = bmi, y = charges))+
  geom_point(aes(color = smoker, shape = smoker))+
  geom_smooth(aes(color = smoker, fill = smoker), 
              method = "lm", fullrange = TRUE) +
  facet_wrap(~smoker) +
  stat_cor(aes(label = ..r.label..))+
  scale_color_manual(values = c("#00AFBB", "#E7B800"))+
  scale_fill_manual(values = c("#00AFBB", "#E7B800")) +
  theme_bw()+ ggtitle("SRSWOR")

#Systematic sampling

N<-nrow(health_ins)
n<-500
k<-ceiling(N/n)

# random item from first group
r <- sample(k, 1)

# select every kth item

s<-seq(r, by = k, length = n)

set.seed(2698)

sample.3 <- health_ins[s, ]

b2<-ggplot(sample.3[!is.na(sample.3$age),], aes(x = bmi, y = charges))+
  geom_point(aes(color = smoker, shape = smoker))+
  geom_smooth(aes(color = smoker, fill = smoker), 
              method = "lm", fullrange = TRUE) +
  facet_wrap(~smoker) +
  stat_cor(aes(label = ..r.label..))+
  scale_color_manual(values = c("#00AFBB", "#E7B800"))+
  scale_fill_manual(values = c("#00AFBB", "#E7B800")) +
  theme_bw() + ggtitle("Systematic Sampling")

#Stratified sampling
set.seed(2698)
st.1 <- sampling::strata(health_ins, stratanames = c("region"),
                         size = rep(125, 4), method = "srswor",
                         description = TRUE)
## Stratum 1 
## 
## Population total and number of selected units: 325 125 
## Stratum 2 
## 
## Population total and number of selected units: 364 125 
## Stratum 3 
## 
## Population total and number of selected units: 325 125 
## Stratum 4 
## 
## Population total and number of selected units: 324 125 
## Number of strata  4 
## Total number of selected units 500
st.sample1 <- sampling::getdata(health_ins, st.1)

b3<-ggplot(st.sample1, aes(x = bmi, y = charges))+
  geom_point(aes(color = smoker, shape = smoker))+
  geom_smooth(aes(color = smoker, fill = smoker), 
              method = "lm", fullrange = TRUE) +
  facet_wrap(~smoker) +
  stat_cor(aes(label = ..r.label..))+
  scale_color_manual(values = c("#00AFBB", "#E7B800"))+
  scale_fill_manual(values = c("#00AFBB", "#E7B800")) +
  theme_bw()+ ggtitle("Stratified Sampling")

ggarrange(a1,b1,b2,b3, nrow = 2, ncol = 2)

paste0("Population has percentage of smokers",round(table(health_ins$smoker)[2]/nrow(health_ins)*100,2),' %',' and SWSWOR sample has percentage of smokers: ',table(sample.2$smoker)[2]/nrow(sample.2)*100,' %')
## [1] "Population has percentage of smokers20.48 % and SWSWOR sample has percentage of smokers: 24.6 %"
paste0("Population has percentage of smokers",round(table(health_ins$smoker)[2]/nrow(health_ins)*100,2),' %',' and Systematic sample has percentage of smokers: ',table(sample.3$smoker)[2]/nrow(sample.3)*100,' %')
## [1] "Population has percentage of smokers20.48 % and Systematic sample has percentage of smokers: 20 %"
paste0("Population has percentage of smokers ",round(table(health_ins$smoker)[2]/nrow(health_ins)*100,2),' %',' and stratfied sample has percentage of smokers: ',table(st.sample1$smoker)[2]/nrow(st.sample1)*100,' %')
## [1] "Population has percentage of smokers 20.48 % and stratfied sample has percentage of smokers: 17.6 %"

Data wrangling techniques

Top 2 Regions with Highest Average Charges

This contains information about the top two regions with the highest average charges.The dataset (health_ins) is grouped by the region and mean of the charges is calculated for each region The top two regions are selected.

top_2_region<- health_ins %>% group_by(region) %>% summarise(mean_charges=mean(charges)) %>% 
  arrange(-mean_charges) %>% filter(row_number()<=2)

top_2_region$region<- as.character(top_2_region$region)
## Region: southeast    Mean Charges: 14735.41 
## Region: northeast    Mean Charges: 13406.38

Average age of adults by number of dependents

This analysis helps explore the relationship between age, number of dependents, and gender in the given dataset

Data set is grouped by both children and sex variables.The mean of the age variable is calculated for each combination of children and sex. The results are spread to have separate columns for male and female mean ages

sex_avg_age<-health_ins %>% group_by(children,sex) %>% summarise(mean_age=round(mean(age),1)) %>% 
                  select(children, sex, mean_age) %>% spread(sex,mean_age)

#mean age for each children group is calculated
avg_age<- health_ins %>% group_by(children) %>% summarise(overall_age=round(mean(age),1))

#The results are joined to have a final table with mean ages for both sexes and the overall mean age
final_avg_age<- inner_join(sex_avg_age,avg_age)
final_avg_age
## # A tibble: 6 × 4
## # Groups:   children [6]
##   children female  male overall_age
##   <fct>     <dbl> <dbl>       <dbl>
## 1 0          38.3  38.5        38.4
## 2 1          39.5  39.4        39.5
## 3 2          40.5  38.4        39.4
## 4 3          42.2  41          41.6
## 5 4          42    36.6        39  
## 6 5          37    34.5        35.6

Conclusion:

The data set explores the relationship between insurance charges and how they are affected by various attributes. There are some observations in the data set that might not be in sync with reality as the author has taken into account specific attributes for analysis. There are social and environmental factors at play which might be the reason for data being skewed in some observations.Overall, the data reflects how personal and geographic attributes can affect health insurance charges.